Various Insights on Social Aspects in Berlin#
Group report for the course Data Visualization by:
Robert Wienröder
Leon Ostermann
Christian Tesch
Abstract#
Berlin’s Open Data Platform is utilized to visually depict and analyze various social aspects within the city in a concise manner. By evaluating heterogeneous spaces with approximately 7000 inhabitants, a positive correlation is revealed between the rates for child poverty, social benefits receivers and unemployed people. Furthermore, an investigation into crime occurrences demonstrates regionally distinct distributions of crime frequency and types across Berlin.
1. Introduction#
In this report, we present the results of our group project for the course Data Visualization. Within the group, we agreed to explore data from our immediate surroundings that directly impact our everyday lifes. Hence, we chose datasets from the Berlin Open Data Portal [1]. Berlin Open Data is an open access data catalog of the Senate of Berlin and features an abundance of administration and socioeconomics associated datasets.
Through the utilization of this approach, our goal is to gain insights into our daily surroundings and to compare our intuitive understanding of social aspects with a visual analysis of empirical data.
1.1. Data Selection#
We first chose approximately 30 datasets primarily centered around social aspects such as unemployment rates, income data, pollution, health status, and recreational areas. Besides that, we included some datasets from the traffic sector, covering road traffic accidents, traffic signal systems, and speed limits. Our intention was to uncover potential connections or correlations between these areas. However, we had to abandon this objective due to the fundamentally different data collection methods between social and traffic data.
Moreover, the geographical units, known as LOR (Lebensweltlich orientierte Räume or “life-oriented spaces”), underwent a complete overhaul in 2021, making most of the datasets from before 2021 unusable. Some exceptions existed, where data like registered crimes had been reconfigured to fit the new LOR divisions. Furthermore, we faced technical challenges with certain datasets that made them infeasible to visualize due to errors on the open data portal.
Eventually, we selected four datasets:
Kriminalitätsatlas [2]: crime data with about 20 distinct features regarding different types of crimes
Umweltgerechtigkeit [3]: dataset encompassing various burdens like pollution and noise
Monitoring Soziale Stadtentwicklung [4]: social indicators such as the unemployment rate
LOR Planungsräume [5]: geodata for the LOR districts
1.2. Data Preparation#
The data preprocessing phase was time-consuming due to several obstacles. Some of the challenges only became apparent upon data examination, which led to extensive preprocessing of a bunch of datasets even before selecting the datasets for analysis. The data arrived in various file formats, and in one instance, we had to manually extract information from the table displayed on the open data portal since the download button was non-functional.
Moreover, we had to eliminate numerous columns and rows that were irrelevant or could potentially skew the results. For example, some rows represented sums of sub-districts within the same district, leading to highly misleading plots if not removed from the analysis.
1.3. Remark#
Certain plots were not displayed as intended in the HTML file, appearing as static images without animations. This made it necessary to export these graphics as separate files and embed them directly into the html file. To ensure consistent visual presentation, this was done for almost all relevant visualizations (exceptions were only th 3D plots, which could not be embedded). The code remains accessible for review, yielding the same results one can see in the embedded images. Typical environments for Python and R can be used to yield those images as outputs.
2. Visualization of Social Aspects and Concise Analysis#
2.1. Spatial Distribution of Environmental Justice and Child Poverty#
In order to get a first impression of our preprocessed data we visualize the spatial distribution within Berlin of two features.
The first feature is the so called environmental justice. Based on [6], we measure environmental justice by the count of burdens, where a burden is the classification of the respective area into the worst category in terms of 1 out of 5 indicators: noise, air pollution, heat, green spaces and social disadvantage. This results in a burden count between 0 and 5. The higher the count of burdens, the worse the situation is in the area. The following interactive heatmap visualizes the environmental justice by LOR:
Show code cell source
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
from IPython import display
import time
import warnings
warnings.filterwarnings("ignore")
complete_df = pd.read_pickle("../data/07_complete_df.pkl")
mehrfachbelastungen_df = complete_df[["PLR_NAME", "geometry", "Anzahl_Mehrfachbelastungen"]]
mehrfachbelastungen_df["Anzahl_Mehrfachbelastungen"] = mehrfachbelastungen_df["Anzahl_Mehrfachbelastungen"].astype(int)
mehrfachbelastungen_df = mehrfachbelastungen_df.rename(columns={"Anzahl_Mehrfachbelastungen": "Number of multiple burdens", "PLR_NAME": "Area"})
mehrfachbelastungen_gdf = gpd.GeoDataFrame(mehrfachbelastungen_df, geometry="geometry")
mehrfachbelastungen_gdf.explore(column="Number of multiple burdens", cmap="Spectral_r", legend=True)
The map allows for a fine granular visualization, as hovering over a specific LOR displays its respective number of burdens. Qualitatively we determine, that LORs with a high number of burdens seem to be more concentrated in the center part of Berlin.
The following heatmap visualizes the percentage of children living in poverty in each LOR:
Show code cell source
kinderarmut_df = complete_df[["PLR_NAME", "geometry", "Kinderarmut"]]
kinderarmut_df = kinderarmut_df.rename(columns={"Kinderarmut": "Percentage of children living in poverty", "PLR_NAME": "Area"})
kinderarmut_gdf = gpd.GeoDataFrame(kinderarmut_df, geometry="geometry")
kinderarmut_gdf.explore(column="Percentage of children living in poverty", cmap="YlOrRd", legend=True)
We see how child poverty is unevenly distributed across the area of Berlin. We can make out areas with extremely high child poverty next to areas with an extremely small percentage of child poverty.
2.2. Temporal Development of Various Crime Features#
Among the selected datasets in Section 1.1, the Kriminalitätsatlas stood out as the one with the highest usability. It provides about 20 relevant features. The dataset was divided into subsets for each year from 2013 to 2022. The data had been reconfigured to fit the newly introduced LOR divisions, which made it easily usable for our purpose.
2.2.1. Overall Crime#
Show code cell source
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
from IPython import display
import time
import warnings
warnings.filterwarnings("ignore")
tab10 = sns.color_palette("tab10")
set1 = sns.color_palette("Set1")
colors = ["#000099", set1[1], tab10[9], set1[2], "#99cc00", "#ffcc00", set1[4], "#ff0000", "#cc0000", "#990000", "#660000", "#000000"]
positions = [0.0, 0.03, 0.05, 0.07, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.45, 1.0]
custom_cmap = mcolors.LinearSegmentedColormap.from_list('custom_colormap', list(zip(positions, colors)))
custom_cmap
crime_dev = pd.read_pickle("../data/05_kriminalitaet_2013_bis_2022_hz.pkl")
crime_dev["Straftaten \n-insgesamt-"] = crime_dev["Straftaten \n-insgesamt-"].astype(int)
max_value = max(crime_dev["Straftaten \n-insgesamt-"])
min_value = min(crime_dev["Straftaten \n-insgesamt-"])
vmin, vmax = min_value, max_value
for year in range(2013, 2023):
year_df = crime_dev[crime_dev['Year'] == year]
straftaten_gesamt_df = year_df[["PLR_NAME", "geometry", "Straftaten \n-insgesamt-"]]
straftaten_gesamt_df = straftaten_gesamt_df.rename(columns={"Straftaten \n-insgesamt-": "Crimes per 100.000 residents",
"PLR_NAME": "Area"})
straftaten_gesamt_df = gpd.GeoDataFrame(straftaten_gesamt_df, geometry="geometry")
fig, ax = plt.subplots()
straftaten_gesamt_df.plot(column="Crimes per 100.000 residents", cmap=custom_cmap, legend=True,
ax=ax, vmin=vmin, vmax=vmax, edgecolor='None')
#ax.set_title(f"Crimes per 100,000 residents \nper district in {year}", fontsize=20, color="blue", fontweight="bold")
ax.set_title(f"Crimes per 100,000 residents \n in {year}", fontsize=14)
ax.set_axis_off()
plt.gcf().set_dpi(600)
display.display(plt.gcf())
display.clear_output(wait=True)
time.sleep(2)
plt.close(fig)
In Fig. 2.1 we showcase a heatmap similar to the two maps in Section 2.1. It presents the spatial distribution of the number of overall crimes per 100,000 residents per district over the past ten years (2013-2022).
Fig. 2.1 Number of crimes per 100,000 residents per district in the years 2013–2022.#
The frequency of overall crime seems to be more concentrated in the center parts of Berlin. This makes us wonder if there might be a general correlation with population density, and indeed, there is a slight correlation shown in Section 2.3.3 (Fig. 2.7).
Fig. 2.2 shows just like the map above the number of crimes per 100,000 residents per district in 2013–2022, but the display as a line plot allows for an easier interpretation of development over time.
Fig. 2.2 Number of crimes per 100,000 residents in the years 2013–2022 by district.#
2.2.2. Fire Crimes, Car Theft, and Burglary#
In the following section we investigate the number of fire crimes, car theft and burglary per 100,000 residents by district per year.
The goal for the temporal development of crime rates was not only to look at the overall crime rates, but also to show the trend for various types of crimes to identify noteworthy differences. However, creating and comparing 20 individual plots would have been quite a complex task.
To address this, the mean value of each crime feature for each district over the last 10 years was computed. Subsequently, a Principal Component Analysis (PCA) was conducted with these mean values, leading to the creation of a biplot. The biplot revealed that most crime features showed similarities, while car theft, fire crimes and burglary appeared to be distinct from the others, making them especially interesting for further investigation.
The plot for fire crimes is shown in Fig. 2.3. We cannot make out a clear pattern.
Show code cell source
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
from IPython import display
import time
import warnings
warnings.filterwarnings("ignore")
set1 = sns.color_palette("Set1")
tab10 = sns.color_palette("tab10")
qualitative_palette = sns.color_palette([set1[0], "#000099", set1[4], "#ff66ff", "#006600", "#ffcc00", "#000000", tab10[9], tab10[5], set1[8], "#00cc00", "#660066"])
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
import warnings
warnings.filterwarnings("ignore")
bezirkskriminalitaet = pd.read_pickle("../data/08_kriminalitaet_auf_bezirksebene_2013_bis_2022.pkl")
bezirkskriminalitaet['BEZ'] = bezirkskriminalitaet['LOR-Schlüssel (Bezirksregion)'].astype(str).str[:2]
bezirkskriminalitaet = bezirkskriminalitaet.rename(columns={"Straftaten \n-insgesamt-": "Overall crime rate",
"Branddelikte \n-insgesamt-": "Fire crimes",
"Diebstahl von Kraftwagen": "Car theft",
"Wohnraum-\neinbruch": "Burglary"})
# Create a list of features to iterate over
features_list = ["Overall crime rate", "Fire crimes", "Car theft", "Burglary"]
# Create the categorical color palette with unique colors for each category
categories = bezirkskriminalitaet["Bezeichnung (Bezirksregion)"].unique()
num_categories = len(categories)
# Create the subplots (4 rows, 1 column)
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10, 32), sharex=True, gridspec_kw={'top': 0.95})
# Iterate over the features and plot each on its respective subplot
for idx, feature in enumerate(features_list):
# Convert the current feature column to integers
bezirkskriminalitaet[feature] = bezirkskriminalitaet[feature].astype(int)
# Group the data by "Bezeichnung (Bezirksregion)" and "Year" and sum the values for the current feature
grouped_data = bezirkskriminalitaet.groupby(["Bezeichnung (Bezirksregion)", "Year"])[feature].sum().reset_index()
# Select the current subplot
ax = axes[idx]
# Iterate over the unique values in the "Bezeichnung (Bezirksregion)" column
for category, color in zip(categories, qualitative_palette):
# Filter the data for the current category
category_data = grouped_data[grouped_data["Bezeichnung (Bezirksregion)"] == category]
# Plot the line for the current category on the current subplot
ax.plot(category_data["Year"], category_data[feature], label=category, color=color, linewidth=3)
# Set the subplot title, labels, and legend
ax.set_title(f"{feature}", fontsize=15)
ax.set_ylabel(f"{feature} per 100,000 residents", fontsize=12)
ax.legend(title="Districts", loc="upper right", fontsize=8)
# Get all unique years in the data for the current subplot
unique_years = bezirkskriminalitaet["Year"].unique()
# Set the x-ticks to be all unique years for the current subplot
ax.set_xticks(unique_years)
ax.tick_params(axis='both', which='both', labelbottom=True)
# Disable scientific notation on the y-axis
ax.ticklabel_format(style='plain')
# Set the background color to gray
ax.set_facecolor('lightgray')
# Add a grid
ax.grid(color='white', linestyle='-', linewidth=0.5)
plt.suptitle('Temporal development (2013-2022) of numbers per 100,000 residents \nof different kinds of crimes split by Berlin districts', fontsize=20)
# Adjust the layout
plt.tight_layout()
# Show the plot
plt.show()
Fig. 2.3 Number of fire crimes per 100,000 residents in the years 2013–2022 by district.#
The numbers for each district display significant variations. For instance, Treptow-Köpenick had the third-highest number of fire crimes in 2019 but the lowest in 2020, indicating a weak or non-existent correlation between fire crimes and other indicators.
Looking at the car thefts in Fig. 2.4, we see significant fluctuations from year to year.
Fig. 2.4 Number of car thefts per 100,000 residents in the years 2013–2022 by district.#
These fluctuations differ from the overall crime rate in Fig. 2.2. Notably, Lichtenberg, Marzahn-Hellersdorf, Treptow-Köpenick, and Charlottenburg-Wilmersdorf have the highest car theft numbers, despite being among the lowest for overall crime (except for Charlottenburg-Wilmersdorf).
A possible interpretation is that these districts are perceived as safer, with residents having higher incomes and possessing more valuable cars. This makes them attractive targets for car thieves.
Additionally, it is observed that the car theft rate is notably low for all districts during 2020 and 2021, the years with the heaviest Covid restrictions, increases again in 2022.
The number of burglaries is shown in Fig. 2.5. The numbers for all districts appear relatively similar, particularly since 2016.
Fig. 2.5 Number of burglaries per 100,000 residents in the years 2013–2022 by district.#
Notably, Steglitz-Zehlendorf, known for having the lowest overall crime rate, exhibited a relatively high burglary rate along with Charlottenburg-Wilmersdorf. The most remarkable aspect of this plot was the general declining trend in burglary numbers over the years.
2.3. Correlations Between Selected Social Aspects#
We select the following six features from our preprocessed data, based on measurements from 2021 and 2022:
Population density (residents per km²)
Overall crime rate (crimes per 100,000 residents)
Number of burdens (environmental justice, explained in chapter 2.1)
Unemployment rate (in percent)
Social benefits receiver rate (in percent)
Child poverty rate (in percent)
2.3.1. Feature Selection: PCA#
To get an idea about which features might be correlated, we perform a Principal Component Analysis (PCA). The result is presented in Fig. 2.6 as a biplot of the first two principal components.
Show code cell source
# Change the directory to where the data is stored
setwd("/Users/robert/Documents/Master Data Science/2. Semester/Data Visualization/PROJECT/VisuProj23")
df <- read.csv("data/07_kernfeatures.csv")
df$LOR_str <- sprintf("%08d", df$LOR_str)
df[273,4] = "Schloßstraße Stegl."
df[407,4] = "Schloßstraße Ch'burg"
row.names(df) <- df$PLR_NAME
dfnew <- df[, -c(1:4)]
colnames(dfnew) <- c("Unemployment", "Social benefits", "Child poverty", "Pop. density", "Burdens", "Crimes")
# Perform PCA
one_prcomp <- prcomp(dfnew, scale. = TRUE)
PCs_prcomp <- one_prcomp$x
PCvariances <- one_prcomp$sdev^2
# Create a biplot
biplot(one_prcomp, pc.biplot=TRUE,
xlab=paste0("PC1, variance ", round(PCvariances[1]/6*100, 2),"%"),
ylab=paste0("PC2, variance ", round(PCvariances[2]/6*100, 2),"%"),
main="Biplot of first two principal components for key features",
xlim=c(-4,4), ylim=c(-4,4), cex=c(0.6,0.9),
expand=1.2, asp=1, bg = "darkgrey", panel.first = grid())
symbols(0, 0, circles = 1, inches=FALSE, add=TRUE, fg="red")
Fig. 2.6 Biplot of the first two principal components. Features are population density, overall crime rate, number of burdens, social benefits receiver rate, and child poverty rate.#
The biplot hints at two possible correlations. On the one hand between child poverty, social benefits, and unemployment rate, on the other hand between overall crime rate, number of burdens, and population density.
2.3.2. Child Poverty, Social Benefits, and Unemployment Rate#
As hinted by the PCA, the following interactive scatter plot clearly shows a positive correlation between child poverty, social benefits, and unemployment rate:
Show code cell source
import pandas as pd
import numpy as np
import geopandas as gpd
# import matplotlib.pyplot as plt
# import matplotlib.patches as mpatches
# from mpl_toolkits.mplot3d import Axes3D
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import pickle
from matplotlib.lines import Line2D
from IPython import display
# import time
# fix for plotly jupyter book display https://github.com/executablebooks/jupyter-book/issues/2041
# import plotly.io as pio
# pio.renderers.default = "notebook_connected"
# pio.renderers.default = "notebook"
complete_df = pd.read_pickle("../data/07_complete_df.pkl")
# Normalize the 'Einwohnerstand' column
normalized_einwohnerstand = (complete_df['Einwohnerstand'] - complete_df['Einwohnerstand'].min()) / (complete_df['Einwohnerstand'].max() - complete_df['Einwohnerstand'].min())
# Rename the columns of most important features
complete_df = complete_df.rename(columns={"Einwohnerdichte pro qkm": "Population density",
"PLR_NAME": "Area",
"Anzahl_Mehrfachbelastungen": "Number of burdens",
"Straftaten_gesamt": "Crimes per 100.000 residents",
"Kinderarmut": "Percentage of children living in poverty",
"Arbeitslosigkeit": "Percentage of unemployed people",
"Transferbezug": "Percentage of people receiving social benefits",
"Lärmbelastungs_Kategorie": "Noise pollution category",
"Luftbelastungs_Kategorie": "Air pollution category",
"Thermische_Belastungs_Kategorie": "Thermal pollution category",
"Grünflächenversorgungs_Kategorie": "Green area supply category",
"Soziale_Benachteiligungs_Kategorie": "Social disadvantage category",
"Einwohnerdichte pro qkm": "Population density (residents per square km)",
"Anzahl_Mehrfachbelastungen": "Number of burdens"})
# Create 3D scatter plot for "Kinderarmut", "Transferbezug", and "Arbeitslosigkeit"
fig = go.Figure(data=[go.Scatter3d(
x=complete_df['Percentage of unemployed people'],
y=complete_df['Percentage of people receiving social benefits'],
z=complete_df['Percentage of children living in poverty'],
mode='markers',
marker=dict(size=6, opacity=0.7, color='blue')
)])
fig.update_layout(
# title=dict(text='Child Poverty vs. Social Benefits vs. Unemployment', font=dict(size=25, color='grey')),
scene=dict(
xaxis_title='Unemployment',
yaxis_title='Social benefits',
zaxis_title='Child poverty',
xaxis=dict(title_font=dict(size=20, color='grey'), tickfont=dict(size=10, color='grey')),
yaxis=dict(title_font=dict(size=20, color='grey'), tickfont=dict(size=10, color='grey')),
zaxis=dict(title_font=dict(size=20, color='grey'), tickfont=dict(size=10, color='grey')),
),
width=800,
height=600,
margin=dict(l=0, r=0, b=0, t=60),
paper_bgcolor='rgba(0,0,0,0)', # Set the background color (transparent)
plot_bgcolor='rgba(0,0,0,0)', # Set the plot color (transparent)
)
fig.update_yaxes(title_font_color="red")
fig.show()
This was to be expected, as these features are obviously highly interconnected.
2.3.3. Crimes, Burdens, and Population Density#
The following plot shows the overall crime rate, the number of burdens, and the population density:
Show code cell source
import pandas as pd
import numpy as np
import geopandas as gpd
# import matplotlib.pyplot as plt
# import matplotlib.patches as mpatches
# from mpl_toolkits.mplot3d import Axes3D
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import pickle
from matplotlib.lines import Line2D
from IPython import display
# import time
# fix for plotly jupyter book display https://github.com/executablebooks/jupyter-book/issues/2041
# import plotly.io as pio
# pio.renderers.default = "notebook_connected"
# pio.renderers.default = "notebook"
complete_df = pd.read_pickle("../data/07_complete_df.pkl")
# Normalize the 'Einwohnerstand' column
normalized_einwohnerstand = (complete_df['Einwohnerstand'] - complete_df['Einwohnerstand'].min()) / (complete_df['Einwohnerstand'].max() - complete_df['Einwohnerstand'].min())
# Rename the columns of most important features
complete_df = complete_df.rename(columns={"Einwohnerdichte pro qkm": "Population density",
"PLR_NAME": "Area",
"Anzahl_Mehrfachbelastungen": "Number of burdens",
"Straftaten_gesamt": "Crimes per 100.000 residents",
"Kinderarmut": "Percentage of children living in poverty",
"Arbeitslosigkeit": "Percentage of unemployed people",
"Transferbezug": "Percentage of people receiving social benefits",
"Lärmbelastungs_Kategorie": "Noise pollution category",
"Luftbelastungs_Kategorie": "Air pollution category",
"Thermische_Belastungs_Kategorie": "Thermal pollution category",
"Grünflächenversorgungs_Kategorie": "Green area supply category",
"Soziale_Benachteiligungs_Kategorie": "Social disadvantage category",
"Einwohnerdichte pro qkm": "Population density (residents per square km)",
"Anzahl_Mehrfachbelastungen": "Number of burdens"})
# Create 3D scatter plot for "Einwohnerdichte pro qkm", "Straftaten_gesamt", and "Anzahl_Mehrfachbelastungen"
fig = go.Figure(data=[go.Scatter3d(
x=complete_df['Population density (residents per square km)'],
y=complete_df['Number of burdens'],
z=complete_df['Crimes per 100.000 residents'],
mode='markers',
marker=dict(size=6, opacity=0.7, color='blue')
)])
fig.update_layout(
# title=dict(text='Crimes vs. burdens vs. pop. density', font=dict(size=25, color='grey')),
scene=dict(
xaxis_title='Pop. density',
yaxis_title='Burdens',
zaxis_title='Crimes',
xaxis=dict(title_font=dict(size=20, color='grey'), tickfont=dict(size=10, color='grey')),
yaxis=dict(title_font=dict(size=20, color='grey'), tickfont=dict(size=10, color='grey')),
zaxis=dict(title_font=dict(size=20, color='grey'), tickfont=dict(size=10, color='grey')),
),
width=800,
height=600,
margin=dict(l=0, r=0, b=0, t=60),
paper_bgcolor='rgba(0,0,0,0)', # Set the background color (transparent)
plot_bgcolor='rgba(0,0,0,0)', # Set the plot color (transparent)
)
fig.show()
From this scatter plot we cannot instantaneously make out a clear correlation as we had expected from the biplot in Fig. 2.6. So we investigate the correlations between them further.
Looking at the scatter plot of the crimes vs. the population in Fig. 2.7, we can only make out a very small positive correlation.
Show code cell source
import matplotlib.pyplot as plt
import seaborn as sns
# Create a 2x2 grid of subplots
fig, axs = plt.subplots(1, 3, figsize=(18, 6))
# First Plot
x_feature = "Population density (residents per square km)"
y_feature = "Crimes per 100.000 residents"
ax = complete_df.plot.scatter(
x=x_feature,
y=y_feature,
alpha=0.6,
s=normalized_einwohnerstand * 80,
ax=axs[0] # Add to the first subplot
)
sns.regplot(
x=x_feature,
y=y_feature,
data=complete_df,
scatter=False,
color='r',
order=1,
ax=ax
)
# ax.set_title("Number of crimes vs. population density", fontsize=14)
ax.set_xlabel("Population density (1/km²)")
ax.set_ylabel("Crimes per 100,000 residents")
ax.set_facecolor('lightgray')
ax.grid(color='white', linestyle='-', linewidth=0.5)
ax.tick_params(colors='black', labelcolor='black')
for spine in ax.spines.values():
spine.set_edgecolor('white')
plt.gcf().set_dpi(150)
# Second Plot
sns.set_style("darkgrid")
sns.boxplot(x="Number of burdens", y="Crimes per 100.000 residents", data=complete_df, flierprops={'marker': 'o', 'markersize': 5}, color="lightblue", ax=axs[1]) # Add to the second subplot
axs[1].set_title("Number of crimes vs. number of burdens", fontsize=14)
axs[1].set_xlabel("Number of burdens")
axs[1].set_ylabel("Crimes per 100,000 residents")
axs[1].set_facecolor('lightgray')
axs[1].grid(color='white', linestyle='-', linewidth=0.5)
# Third Plot
sns.set_style("darkgrid")
sns.boxplot(y="Population density (residents per square km)", x="Number of burdens", data=complete_df, flierprops={'marker': 'o', 'markersize': 5}, color="lightblue", ax=axs[2]) # Add to the third subplot
axs[2].set_title("Population density vs. number of burdens", fontsize=14)
axs[2].set_xlabel("Number of burdens")
axs[2].set_ylabel("Population density (1/km²)")
#axs[2].invert_yaxis()
axs[2].set_facecolor('lightgray')
axs[2].grid(color='white', linestyle='-', linewidth=0.5)
# Adjust layout
plt.tight_layout()
#plt.savefig("subplots.png", dpi=150, bbox_inches='tight')
# Show the plots
plt.show()
Fig. 2.7 Crimes per 100,000 residents as a function of population density.#
The boxplots in Fig. 2.8 enable us to clearly recognize a positive correlation between crimes and burdens. The positive correlation between burdens and population density is harder to grasp.
Fig. 2.8 Left: crimes per 100,000 residents vs. the number of burdens. Right: Population density vs. the number of burdens.#
2.3.4. Self-Organizing Map#
Lastly, we want to grasp an overview over the distribution of the features selected in Section 2.3 (population density, overall crime rate, number of burdens (environmental justice), unemployment rate, social benefits receiver rate, child poverty rate). The best way for us to do so without losing too much information is to use a so called Self-Organizing Map (SOM). The SOM is depicted in Fig. 2.9.
Show code cell source
rm(list = ls(all.names = TRUE))
data(swiss)
library(kohonen)
library(RColorBrewer)
setwd("/Users/robert/Documents/Master Data Science/2. Semester/Data Visualization/PROJECT/VisuProj23")
# Load the CSV file into a DataFrame
df <- read.csv("data/07_wichtigste_features_aus_04_bis_06.csv")
df$LOR_str <- sprintf("%08d", df$LOR_str)
df[273,3] = "Schloßstraße Stegl."
df[407,3] = "Schloßstraße Ch'burg"
row.names(df) <- df$PLR_NAME
df <- df[, -c(1:4)]
old <- cur <- Inf
dat <- scale(df)
## iterative improvement of SOM
for (i in 1:100){
erg <- som(dat, grid=somgrid(15,10,"hexagonal"), rlen=1000)
cur <- sum(erg$distances)
if (cur<old){
erg2 <- erg
old <- cur
}
}
som2pts <- function(x){
stopifnot("kohonen" %in% class(x))
x$grid$pts[x$unit.classif,]
}
som_out <- som2pts(erg2)
pal <- function(n) brewer.pal(n, "Set3")
par(lab.cex = 1) # Adjust the legend font size
par(mfrow=c(1,1))
plot(erg2, shape="straight", palette.name=pal)
Fig. 2.9 Self-Organizing Map with the features population density, overall crime rate, number of burdens (environmental justice), unemployment rate, social benefits receiver rate, and child poverty rate.#
The overall picture is that we have high values for all features on the right and low values on the left. Some things are remarkable: Almost everywhere in the codes plot, we can clearly see the high correlation between unemployment, social benefits and child poverty, as the slices for the three features always are of similar size.
In the middle of the last line are examples for the fact that a high population density and a high number of burdens do not necessarily lead to high numbers of crimes, unemployment, social benefits or child poverty. In line six, column four, we see the opposite example, where we have low numbers for all features, except for the crime rate, which is nevertheless relatively high. There are many more insights that one could gain from the SOM plot, mentioning all of them would go beyond the scope of this project.
3. Conclusion#
We investigated various datasets featuring social aspects for Berlin using several different visualization techniques, concentrating on crime and other selected features. For the further we investigated the temporal development of different kinds of crimes for the twelve districts of Berlin over ten years.
It became clear that the almost constant linear curves for the overall crime rates of the districts were significantly different if we looked at single crime features that - suggested by a PCA biplot - were different from the rest of the crime features. The main insight from this is that even in districts where the crime rates are generally very low, the numbers for single kinds of crimes may still be high, which is for example the case for car theft, which apparently occurs more often in districts with a low crime rate which also tend to have wealthier residents which again usually have more expensive cars and are a more popular target for car thieves.
Generally speaking, and that leads us to our second focus point, the crime rates tend to get higher with an increasing number of burdens and/or people (population density), but this correlation is low. The kind of crime and the area seem to have a stronger influence on the numbers. Unemployment, social benefits and child poverty on the other hand, as expected, have a very strong correlation. For the other three, where the correlation is not as strong, we can also see in the SOM plot that exceptions do actually exist, sometimes even in relatively big amounts (the high values for burdens and population and density combined with low values for crimes and the other features, as an example).
Our expectation was to find high or even very high correlations between all the examined features, and we only found them between unemployment, social benefits and child poverty. So while we could confirm some stereotypes about social correlations, most of them were only confirmed with - partially big - reservations.
Building all the datasets for the visualizations from scratch was a big part of our work for this project. Even though it was fun and interesting, if we had to do the whole project over again, we would look for a preprocessed dataset to use for our visualizations. Also, we would scrap certain time-consuming ideas - like implementing a slider for our gif animations - much sooner.
4. References#
- 1
Berlin open data. URL: https://daten.berlin.de/ (visited on 2023-07-23).
- 2
Kriminalitätsatlas berlin. URL: https://daten.berlin.de/datensaetze/kriminalit%C3%A4tsatlas-berlin (visited on 2023-07-11).
- 3
Umweltgerechtigkeit: integrierte mehrfachbelastungskarte umwelt und soziale benachteiligung 2021/2022 (umweltatlas). URL: https://daten.berlin.de/datensaetze/umweltgerechtigkeit-integrierte-mehrfachbelastungskarte-umwelt-und-soziale (visited on 2023-06-22).
- 4
Bericht monitoring soziale stadtentwicklung berlin 2021. URL: https://www.berlin.de/sen/sbw/stadtdaten/stadtwissen/monitoring-soziale-stadtentwicklung/bericht-2021/ (visited on 2023-08-08).
- 5
Lor planungsräume 2021 und metadaten. URL: https://daten.odis-berlin.de/de/dataset/lor_planungsgraeume_2021/ (visited on 2023-08-08).
- 6
Umweltgerechtigkeit berlin. URL: https://www.berlin.de/umweltatlas/mensch/umweltgerechtigkeit/ (visited on 2023-06-22).